I heard about this R course’s usefulness in handling research data quite efficiently so my interest was guaranteed, although it takes some time to get more familiar with this programming language. The idea of this course’s validness for me was through an acquintance.
Looking forward to challenge myself with this totally new thing, and finding some tools for my research data analyzes. More about R Studio you can find from here
students2014 <- read.csv(file = "/Users/streetman/IODS-project/data/lrn14.csv", header = T, sep=",")
students2014 <- students2014[,-1]
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166 7
The data is a subset from the main data, which includes different questions to students and also some demographical variables for the purpose of doing a survey of approaches to learning. This subdata consists 166 rows as observations and 7 columns as variables: gender, Age, Attitude, deep (deep learning related questions), stra (strategic learning related questions), surf (surface learning related questions).
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(students2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
Based on visualization we can detect for example that respondends were almost twice as much females (N=110) and their age varied between 17 and 55, and mean age was 25. Regarding correlation the strongest is between attitude and points and lowest between deep and points.
## Fit a linear model
model_fit <- lm(Points ~ Attitude + stra + surf, data = students2014)
## Print out a summary of the model
summary (model_fit)
##
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
## Fit a linear model
model_fit <- lm(Points ~ Attitude + stra, data = students2014)
## Print out a summary of the model
summary (model_fit)
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## Attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
## Fit a linear model
model_fit <- lm(Points ~ Attitude, data = students2014)
## Print out a summary of the model
summary (model_fit)
##
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The target variable is Points. For the chosen explanatory variables Attitude, stra and surf: with stra and surf there is no statistically significant relationship with the target variable (p=0.117 and 0.466), therefore surf was removed.
After that, with variable stra, there is no statistically significant relationship with the target variable (p=0.089), therefore it was removed.
After that, with variable attitude, there is statistically significant correlation (p = <0.000) to points with an explanatory value of 0.1906 (mult. R-squared) which mean that this variable explains approx. 19% of the points achieved by students.
my_model2 <- lm(Points ~ Attitude, data = students2014)
plot(my_model2, which = c(1,2,5))
The residuals are the differences between the predicted and observed value in a given point. “Residuals vs. Fitted”: assumption that errors don’t depend on variable attitude is valid since the plots are evenly spread without any specific pattern. “Normal Q-Q”: assumption is that the residuals are normally distributed and this supports it. “Residuals vs. Leverage”: in general observations are not having unusually high impact.
getwd()
## [1] "/Users/streetman/IODS-project"
alc <- read.csv(file = "/Users/streetman/IODS-project/data/alc.csv", header = T, sep=",")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
The dataset includes the variables above indicating the questions to students relating to their alcohol usage
library(tidyr)
glimpse(alc)
## Observations: 382
## Variables: 36
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G…
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F,…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 1…
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3…
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T,…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3,…
## $ Mjob <fct> at_home, at_home, at_home, health, other, services, o…
## $ Fjob <fct> teacher, other, other, services, other, other, other,…
## $ reason <fct> course, course, other, home, home, reputation, home, …
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, y…
## $ guardian <fct> mother, father, mother, mother, father, mother, mothe…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2,…
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no…
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, y…
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, y…
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, …
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4,…
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3,…
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 1…
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, …
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12,…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…
These four variables were chosen: 1. Age (numeric) 2. Famsize (binary: less or equal to 3, or greater than 3) 3. Failures (numeric: 0-3, or 4 if smth else) 4. Absences (numeric: 0 to 93)
Hypothesis: there is a correlation between the high alcohol consumption and the chosen variables when analyzed in a univariate method to observe correlation H0: there is no correlation between a chosen variable and high alcohol consumption H1: there is significant correlation between a chosen variable and high alcohol consumption
summary(alc)
## X school sex age address famsize
## Min. : 1.00 GP:342 F:198 Min. :15.00 R: 81 GT3:278
## 1st Qu.: 96.25 MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104
## Median :191.50 Median :17.00
## Mean :191.50 Mean :16.59
## 3rd Qu.:286.75 3rd Qu.:17.00
## Max. :382.00 Max. :22.00
## Pstatus Medu Fedu Mjob Fjob
## A: 38 Min. :0.000 Min. :0.000 at_home : 53 at_home : 16
## T:344 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17
## Median :3.000 Median :3.000 other :138 other :211
## Mean :2.806 Mean :2.565 services: 96 services:107
## 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31
## Max. :4.000 Max. :4.000
## reason nursery internet guardian traveltime
## course :140 no : 72 no : 58 father: 91 Min. :1.000
## home :110 yes:310 yes:324 mother:275 1st Qu.:1.000
## other : 34 other : 16 Median :1.000
## reputation: 98 Mean :1.448
## 3rd Qu.:2.000
## Max. :4.000
## studytime failures schoolsup famsup paid activities
## Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181
## 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201
## Median :2.000 Median :0.0000
## Mean :2.037 Mean :0.2016
## 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :3.0000
## higher romantic famrel freetime goout
## no : 18 no :261 Min. :1.000 Min. :1.00 Min. :1.000
## yes:364 yes:121 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000
## Median :4.000 Median :3.00 Median :3.000
## Mean :3.937 Mean :3.22 Mean :3.113
## 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :5.000 Max. :5.00 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.0
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 1.0
## Median :1.000 Median :2.000 Median :4.000 Median : 3.0
## Mean :1.482 Mean :2.296 Mean :3.573 Mean : 4.5
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :45.0
## G1 G2 G3 alc_use
## Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median :12.00 Median :12.00 Median :12.00 Median :1.500
## Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000
## high_use
## Mode :logical
## FALSE:268
## TRUE :114
##
##
##
For example in variable “absences” the mean absence value is 4.5 and maximum is 45. Range is wide, although 3rd quartile is 6.0, therefore most are within values 0 to 6.0.
library(tidyr); library(dplyr); library(ggplot2)
alc_selected <- select(alc, one_of(c("age", "famsize", "failures", "absences")))
gather(alc_selected) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
For example in variable “age” we can detect that mainly the participants are below 19 years old.
g1 <- ggplot(alc, aes(x = high_use, y = age))
g1 + geom_boxplot(aes(col=sex)) + ylab("age") + ggtitle("Students alcohol consumption and age")
Those with high alcohol consumption were mainly male and their age range is wider compared to female within this consumption category.
g1 <- ggplot(data = alc, aes(x = alc_use, fill = famsize))
g1 + geom_bar()
It seems that more higher the alcohol consumption then more equal are the family sizes. With lower alcohol consumption the family sizes were considerable larger.
alc$failures <- as.factor(alc$failures)
g1 <- ggplot(alc, aes(x = high_use, fill = failures))
g1 + geom_bar()
Somewhat surpringly, in lower alcohol consumption the amount of failures is higher.
g1 <- ggplot(alc, aes(x = high_use, y = absences))
g1 + geom_boxplot(aes(col=sex)) + ylab("absences") + ggtitle("Students alcohol consumption and absences")
It seems that more the absences then more are is the high alcohol consumption with both sexes.
## Fit a linear model
alc$failures <- as.numeric(alc$failures)
alc$famsize <- as.numeric(alc$famsize)
model_fit <- lm(high_use ~ age + famsize + failures + absences, data = alc)
## Print out a summary of the model
summary (model_fit)
##
## Call:
## lm(formula = high_use ~ age + famsize + failures + absences,
## data = alc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9850 -0.2883 -0.2132 0.5052 0.8587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.399649 0.327648 -1.220 0.22332
## age 0.023963 0.019765 1.212 0.22612
## famsize 0.075069 0.050854 1.476 0.14073
## failures 0.106463 0.039562 2.691 0.00744 **
## absences 0.017151 0.004184 4.099 5.09e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4419 on 377 degrees of freedom
## Multiple R-squared: 0.07954, Adjusted R-squared: 0.06977
## F-statistic: 8.144 on 4 and 377 DF, p-value: 2.625e-06
## Fit a linear model
alc$failures <- as.numeric(alc$failures)
model_fit <- lm(high_use ~ age + failures + absences, data = alc)
## Print out a summary of the model
summary (model_fit)
##
## Call:
## lm(formula = high_use ~ age + failures + absences, data = alc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0151 -0.2787 -0.2162 0.5361 0.8393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.308391 0.322265 -0.957 0.33920
## age 0.024304 0.019795 1.228 0.22029
## failures 0.104492 0.039601 2.639 0.00867 **
## absences 0.017366 0.004188 4.146 4.17e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4426 on 378 degrees of freedom
## Multiple R-squared: 0.07422, Adjusted R-squared: 0.06687
## F-statistic: 10.1 on 3 and 378 DF, p-value: 2.045e-06
## Fit a linear model
alc$failures <- as.numeric(alc$failures)
alc$famsize <- as.numeric(alc$famsize)
model_fit <- lm(high_use ~ failures + absences, data = alc)
## Print out a summary of the model
summary (model_fit)
##
## Call:
## lm(formula = high_use ~ failures + absences, data = alc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0035 -0.2801 -0.2127 0.5510 0.8053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.081634 0.054308 1.503 0.13363
## failures 0.113115 0.038999 2.900 0.00394 **
## absences 0.017973 0.004162 4.319 2.01e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4429 on 379 degrees of freedom
## Multiple R-squared: 0.07052, Adjusted R-squared: 0.06562
## F-statistic: 14.38 on 2 and 379 DF, p-value: 9.573e-07
The target variable is high_use (of alcohol). For the chosen explanatory variables age, family size, failures and absences: with age and family size there is no statistically significant relationship with the target variable (p=0.089 and 0.170), therefore family size was removed.
After that, with variable age, there is no statistically significant relationship with the target variable (p=0.089), therefore it was removed.
After that, with variables failures and absences, there is statistically significant correlation (p=0.004 and <0.000) to high_use (of alcohol) with an explanatory value of 0.07 (mult. R-squared) which means that these variables explains approx. 7% of the high_use (of alcohol) among the participants.
### Fitting the logistic regressio model
m <- glm(high_use ~ failures + absences, data = alc, family = "binomial")
### Compute odds ratios (OR)
OR <- coef(m) %>% exp
### Compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
### Printing the output
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1528951 0.08788853 0.2604241
## failures 1.6482547 1.14889905 2.3830880
## absences 1.0898257 1.04433337 1.1423489
Both variables’, failures and absences, ORs indicate higher odds for high_use (of alcohol) within these variables; yet, the CI crosses value 1 with the variable failures, therefore it seems to inadequate for showing statistical significance in this sense. By this analysis, from the chosen variables, absenses seems to be the one with most statistical significance in its relation to high_use (of alcohol).
## Fit the model
m <- glm(high_use ~ absences, data = alc, family = "binomial")
## Predict() the probability of high_use
probabilities <- predict(m, type = "response")
## Add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
## Use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
## Tabulate the target variable versus the predictions
results <- table(high_use = alc$high_use, prediction = alc$prediction)
results
## prediction
## high_use FALSE TRUE
## FALSE 263 5
## TRUE 105 9
tab <- table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
tab
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68848168 0.01308901 0.70157068
## TRUE 0.27486911 0.02356021 0.29842932
## Sum 0.96335079 0.03664921 1.00000000
## Total proportion of inaccurately classified individuals
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2879581
After LR analysis, it was stated that variable “absences” has the most statistical significance it relation to high_use (of alcohol), therefore chosen here to test the predictive power of the LR model.
After calculation, the value of 0.2879 indicates that approx. 29% are incorrect predictions with this model, yet 71% are correct ones, therefore this LR model is useful.
install.packages(“MASS”)
install.packages(“corrplot”)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(corrplot)
## corrplot 0.84 loaded
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The dataset has 14 columns (variables) and 506 rows (observations). The data is about housing values in Boston suburbs. More about the data variables you can find here
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
cor_matrix <- cor(Boston)
cor_matrix %>% round(2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
"VISUALIZING the correlations"
## [1] "VISUALIZING the correlations"
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Crime rate seems to have strongest positive correlation to property taxation rate and highways accessibility.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
We scaled the data to get standardization for to be able to compare variables in a better way.
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low","med_low","med_high","high"), include.lowest = TRUE)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
nrow(boston_scaled)
## [1] 506
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
summary(test)
## zn indus chas
## Min. :-0.487240 Min. :-1.42220 Min. :-0.27233
## 1st Qu.:-0.487240 1st Qu.:-0.86210 1st Qu.:-0.27233
## Median :-0.487240 Median :-0.21089 Median :-0.27233
## Mean : 0.008159 Mean : 0.04924 Mean : 0.03646
## 3rd Qu.: 0.209513 3rd Qu.: 1.01499 3rd Qu.:-0.27233
## Max. : 3.586088 Max. : 2.42017 Max. : 3.66477
## nox rm age
## Min. :-1.42991 Min. :-3.44659 Min. :-2.222999
## 1st Qu.:-0.96024 1st Qu.:-0.58123 1st Qu.:-0.814417
## Median :-0.14407 Median :-0.07918 Median : 0.317068
## Mean :-0.03968 Mean : 0.06373 Mean : 0.009251
## 3rd Qu.: 0.59809 3rd Qu.: 0.60754 3rd Qu.: 0.897020
## Max. : 2.72965 Max. : 3.55153 Max. : 1.116390
## dis rad tax
## Min. :-1.17464 Min. :-0.981871 Min. :-1.31269
## 1st Qu.:-0.82567 1st Qu.:-0.637331 1st Qu.:-0.77572
## Median :-0.19537 Median :-0.522484 Median :-0.39895
## Mean : 0.03516 Mean : 0.003333 Mean : 0.02471
## 3rd Qu.: 0.55661 3rd Qu.: 1.659603 3rd Qu.: 1.52941
## Max. : 3.95660 Max. : 1.659603 Max. : 1.79642
## ptratio black lstat
## Min. :-2.51994 Min. :-3.72665 Min. :-1.52961
## 1st Qu.:-0.75315 1st Qu.: 0.22700 1st Qu.:-0.75067
## Median : 0.11292 Median : 0.39368 Median :-0.29170
## Mean :-0.03244 Mean : 0.09837 Mean :-0.01283
## 3rd Qu.: 0.80578 3rd Qu.: 0.44062 3rd Qu.: 0.63393
## Max. : 1.26768 Max. : 0.44062 Max. : 2.70785
## medv
## Min. :-1.66713
## 1st Qu.:-0.63148
## Median :-0.15035
## Mean : 0.05986
## 3rd Qu.: 0.39058
## Max. : 2.98650
We made a categorical variable of crime rate with the division into quantiles indicated above. We also made datasets (“training” and “test”) for testing and implementing the LDA (linear discriminant analysis) model.
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2376238 0.2574257 0.2500000 0.2549505
##
## Group means:
## zn indus chas nox rm
## low 1.0411511 -0.9586192 -0.108283225 -0.8662085 0.46064529
## med_low -0.1139205 -0.3422746 -0.007331936 -0.5788835 -0.15673511
## med_high -0.3836558 0.1766433 0.117482837 0.3997728 0.04318968
## high -0.4872402 1.0170891 -0.042983423 1.0391280 -0.37654422
## age dis rad tax ptratio
## low -0.8969392 0.8418206 -0.6863802 -0.7208332 -0.42693157
## med_low -0.3762243 0.3805142 -0.5523004 -0.5271986 -0.06295901
## med_high 0.4084842 -0.3704369 -0.4531215 -0.3411977 -0.29319053
## high 0.8061459 -0.8403926 1.6384176 1.5142626 0.78111358
## black lstat medv
## low 0.3726995 -0.79330536 0.53328674
## med_low 0.3206780 -0.15021277 -0.01213996
## med_high 0.1079413 0.01084318 0.13799712
## high -0.8744200 0.89313159 -0.67938128
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.06051850 0.8886752341 -0.95796121
## indus 0.13171351 -0.3876117205 0.33691394
## chas -0.02283016 -0.0040069457 0.12751008
## nox 0.33985474 -0.6145645311 -1.33837578
## rm 0.08064396 -0.0580078405 -0.14349424
## age 0.21750725 -0.3428648511 -0.18757385
## dis -0.02077333 -0.4596329599 0.31658896
## rad 3.57518072 0.9711249633 0.24593269
## tax 0.11780862 -0.1060161167 0.24891951
## ptratio 0.12453220 0.0350012204 -0.31904844
## black -0.05683481 -0.0004998502 0.07551261
## lstat 0.21038037 -0.2878402466 0.34365478
## medv 0.07644718 -0.5088738221 -0.25129282
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9543 0.0338 0.0119
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
Here we tested the LDA model to the “training” dataset, and by visualization we detect distintiveness between crime rate groups with some amount of overlapping between them.
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 11 18 2 0
## med_low 5 7 10 0
## med_high 2 6 14 3
## high 0 0 0 24
Here we implemented the LDA to “test” dataset. It indicates that higher crime values are better predicted than lower ones. As a result, not the optimal prediction model, but fairly good.
data(Boston)
boston_scaled_2 <- scale(Boston)
str(boston_scaled_2)
## num [1:506, 1:14] -0.419 -0.417 -0.417 -0.416 -0.412 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:506] "1" "2" "3" "4" ...
## ..$ : chr [1:14] "crim" "zn" "indus" "chas" ...
## - attr(*, "scaled:center")= Named num [1:14] 3.6135 11.3636 11.1368 0.0692 0.5547 ...
## ..- attr(*, "names")= chr [1:14] "crim" "zn" "indus" "chas" ...
## - attr(*, "scaled:scale")= Named num [1:14] 8.602 23.322 6.86 0.254 0.116 ...
## ..- attr(*, "names")= chr [1:14] "crim" "zn" "indus" "chas" ...
dim(boston_scaled_2)
## [1] 506 14
dist_eu <- dist(boston_scaled_2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
dist_man <- dist(boston_scaled_2, method = "manhattan")
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
km <- kmeans(boston_scaled_2, centers = 3)
pairs(boston_scaled_2, col = km$cluster)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
km <- kmeans(boston_scaled_2, centers = 2)
pairs(boston_scaled_2, col = km$cluster)
Here we clustered the data by loading it again and concluded that the it is the most optimally clustered via two clusters as indicated in the graph above where curve changes the most at this point.
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(GGally)
library(corrplot)
library(tidyr)
library(ggplot2)
getwd()
## [1] "/Users/streetman/IODS-project"
human <- read.csv(file = "/Users/streetman/IODS-project/data/human.csv", header = T, sep=",")
dim(human)
## [1] 155 9
str(human)
## 'data.frame': 155 obs. of 9 variables:
## $ X : Factor w/ 155 levels "Afghanistan",..: 105 6 134 41 101 54 67 149 27 102 ...
## $ Edu2.FM : num 97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : Factor w/ 155 levels "1,123","1,228",..: 132 109 124 112 113 111 103 123 108 93 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
## X Edu2.FM Labo.FM Life.Exp
## Afghanistan: 1 Min. : 0.90 Min. :0.1857 Min. :49.00
## Albania : 1 1st Qu.: 27.15 1st Qu.:0.5984 1st Qu.:66.30
## Algeria : 1 Median : 56.60 Median :0.7535 Median :74.20
## Argentina : 1 Mean : 55.37 Mean :0.7074 Mean :71.65
## Armenia : 1 3rd Qu.: 85.15 3rd Qu.:0.8535 3rd Qu.:77.25
## Australia : 1 Max. :100.00 Max. :1.0380 Max. :83.50
## (Other) :149
## Edu.Exp GNI Mat.Mor Ado.Birth
## Min. : 5.40 1,123 : 1 Min. : 1.0 Min. : 0.60
## 1st Qu.:11.25 1,228 : 1 1st Qu.: 11.5 1st Qu.: 12.65
## Median :13.50 1,428 : 1 Median : 49.0 Median : 33.60
## Mean :13.18 1,458 : 1 Mean : 149.1 Mean : 47.16
## 3rd Qu.:15.20 1,507 : 1 3rd Qu.: 190.0 3rd Qu.: 71.95
## Max. :20.20 1,583 : 1 Max. :1100.0 Max. :204.80
## (Other):149
## Parli.F
## Min. : 0.00
## 1st Qu.:12.40
## Median :19.30
## Mean :20.91
## 3rd Qu.:27.95
## Max. :57.50
##
The dataset is about United Nations Development Programme. Variables are in respective order:
Ratio of secondary education of female/male (Edu2.FM) Ratio of labour forced female/male (Labo.FM) Expected years of schooling (Edu.Exp) Life expectancy at birth (Life.Exp) Gross National Income per capita (GNI) Maternal mortality ratio (Mat.Mor) Adolescent birth rate (Ado.Birth) Percentage of female representatives in parliament (Parli.F)
We can detect for example expected years of education is 5.40 years at minimum with a median of 13.18 years. Regarding life expectancy, the minimum value is 49 years of age and mean value being approx. 72 years.
human$GNI <- as.integer(human$GNI)
human_ <- human[,-1]
dim(human_)
## [1] 155 8
str(human_)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : int 132 109 124 112 113 111 103 123 108 93 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
ggpairs(human_)
cor(human_) %>% corrplot(method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
Graphical illustration demonstrates distributions of the data variables and their correlation to each other.
Correlation matrix is giving further supportive evidence of the correlation between data parameters.
pca_human <- prcomp(human_)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 93.4 4.1 1.5 0.7 0.3 0.0 0.0 0.0
biplot(pca_human, choices = 1:2, cex=c(0.8, 1), col=c("grey40", "deeppink2"))
The first table is indicating that majority of the data is explained by principal component 1 (PC1) so the data could be summarized towards that dimension.
The biplot graph indicates that the PC1 explains majority of cross-observation variation with maternal mortality as a driving component.
human_std <- scale(human_)
summary(human_std)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :-1.76122 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.91250 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.03967 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.96275 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 1.44288 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-1.7154 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.8577 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median : 0.0000 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8577 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 1.7154 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca_human <- prcomp(human_std)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 50.7 16.5 12.3 9.7 3.8 2.9 2.6 1.4
biplot(pca_human, choices = 1:2, cex=c(0.8, 1), col=c("grey40", "deeppink2"))
GNI, Edu2.Ratio, Edu.Exp and Life.Exp have negative weights on PC1.
Mat.Mor and Ado.Birth have positive weights on PC1.
Labo.FM has the most positive effect on PC2.
Components related to e.g. women education and life expectancy seem to explain differences between observations due to their largest weights on PC1.
Component related to women labour seems to affect the PC2 the most.
library(FactoMineR)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
library(dplyr)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage = "quali")
The graph shows associations between the categories. Dimensions are explanatory percentages of the variation with respective values shown in prackets.
Dimension 1 and Dimension 2 explain 18,4 % and 17,1% of the variation, respectively.
library(ggplot2)
library(tidyr)
library(dplyr)
ratsl <- read.csv("/Users/streetman/IODS-project/data/RATSL.csv", row.names = 1)
ratsl$Group <- factor(ratsl$Group)
ratsl$ID <- factor(ratsl$ID)
ratsl <- ratsl %>% group_by(Time) %>% mutate(stweight = (Weight - mean(Weight)) / sd(Weight)) %>% ungroup()
str(ratsl)
## Classes 'tbl_df', 'tbl' and 'data.frame': 176 obs. of 6 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight : int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stweight: num -1.001 -1.12 -0.961 -0.842 -0.882 ...
16 rats, divided into 3 groups based on different diets
ggplot(ratsl, aes(x = Time, y = Weight, linetype = ID)) + geom_line() + facet_grid(. ~ Group, labeller = label_both) + scale_linetype_manual(values = rep(1:6, times=4)) + theme(legend.position = "none") + scale_y_continuous(limits = c(min(ratsl$Weight), max(ratsl$Weight)))
Rats are gaining weight with the difference in group 1 that they are lighter in all points with a variance also between the subject with a decrease of value.
n <- ratsl$Time %>% unique() %>% length()
ratssum <- ratsl %>% group_by(Group, Time) %>% summarise(mean = mean(Weight), se = ( sd(Weight) / sqrt(n) )) %>% ungroup()
glimpse(ratssum)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29…
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.…
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939…
par(mfrow = c(1,2))
ggplot(ratssum, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
theme(legend.position = c(0.9,0.5)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
ggplot(ratssum, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.5)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
In time, means are increasing in both groups.
ggplot(ratsl, aes(x = factor(Time), y = Weight, fill = Group)) + geom_boxplot()
rat8s <- ratsl %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
glimpse(rat8s)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273…
ggplot(rat8s, aes(x=Group, y = mean)) + geom_boxplot() + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")
glimpse(rat8s)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273…
rat8s1 <- filter(rat8s, Group == 2 & mean < 550 | Group == 1 & mean > 260 | Group == 3 & mean > 500)
glimpse(rat8s1)
## Observations: 13
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3
## $ ID <fct> 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16
## $ mean <dbl> 261.0909, 260.1818, 266.5455, 269.4545, 274.7273, 274.6364…
ggplot(rat8s1, aes(x=Group, y = mean)) + geom_boxplot() + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")
summary(aov(mean ~ Group, data = rat8s1))
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 175958 87979 2577 2.72e-14 ***
## Residuals 10 341 34
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using T-test, outliers excluded, no difference between the groups yet no signicance either
kruskal.test(mean ~ Group, data = rat8s1)
##
## Kruskal-Wallis rank sum test
##
## data: mean by Group
## Kruskal-Wallis chi-squared = 9.8901, df = 2, p-value = 0.007119
With this test we can find signicance
rats <- read.csv("/Users/streetman/IODS-project/data/rats.csv", row.names = 1)
rat8s2 <- rat8s %>% mutate(baseline = rats$WD1)
glimpse(rat8s2)
## Observations: 16
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7…
## $ baseline <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, …
fit <- lm(mean ~ baseline + Group, data = rat8s2)
summary(fit)
##
## Call:
## lm(formula = mean ~ baseline + Group, data = rat8s2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.732 -3.812 1.991 6.889 13.455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.14886 19.88779 1.516 0.1554
## baseline 0.93194 0.07793 11.959 5.02e-08 ***
## Group2 31.68866 17.11189 1.852 0.0888 .
## Group3 21.52296 21.13931 1.018 0.3287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.62 on 12 degrees of freedom
## Multiple R-squared: 0.9947, Adjusted R-squared: 0.9933
## F-statistic: 747.8 on 3 and 12 DF, p-value: 6.636e-14
Baseline is showing relatedness to means in later phase, yet, after controlment the baseline no signicance between the group means.
bprsl <- read.csv(file = "/Users/streetman/IODS-project/data/bprsl.csv", row.names = 1)
bprsl$treatment <- factor(bprsl$treatment)
bprsl$subject <- factor(bprsl$subject)
glimpse(bprsl)
## Observations: 360
## Variables: 4
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ weeks <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38…
ggplot(bprsl, aes(x = weeks, y = bprs, group = subject)) + geom_line() + facet_grid(. ~ treatment, labeller = label_both)
Points are decreasing in time with all individuals.
bps_reg <- lm(bprs ~ weeks + treatment, data=bprsl)
summary(bps_reg)
##
## Call:
## lm(formula = bprs ~ weeks + treatment, data = bprsl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## weeks -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
LRM of BPRS with explanatory variables of response, weeks and treatment group indicates that decreasing in time with the latter variable being not significant.
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
bps_ref <- lmer(bprs ~ weeks + treatment + (1|subject), data = bprsl, REML=FALSE)
summary(bps_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ weeks + treatment + (1 | subject)
## Data: bprsl
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## weeks -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) weeks
## weeks -0.437
## treatment2 -0.282 0.000
Within-subject elements are not included.
bps_ref2 <- lmer(bprs ~ weeks + treatment + (weeks|subject), data = bprsl, REML=FALSE)
summary(bps_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ weeks + treatment + (weeks | subject)
## Data: bprsl
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7977
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## weeks 0.9609 0.9803 -0.51
## Residual 97.4304 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## weeks -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) weeks
## weeks -0.582
## treatment2 -0.247 0.000
anova(bps_ref, bps_ref2)
## Data: bprsl
## Models:
## bps_ref: bprs ~ weeks + treatment + (1 | subject)
## bps_ref2: bprs ~ weeks + treatment + (weeks | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## bps_ref 5 2748.7 2768.1 -1369.4 2738.7
## bps_ref2 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model 2 is better
bps_ref3 <- lmer(bprs ~ weeks * treatment + (weeks|subject), data = bprsl, REML=FALSE)
summary(bps_ref3)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ weeks * treatment + (weeks | subject)
## Data: bprsl
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## weeks 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## weeks -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## weeks:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) weeks trtmn2
## weeks -0.650
## treatment2 -0.424 0.469
## wks:trtmnt2 0.356 -0.559 -0.840
The model indicates that the BPRS slopes with a decreasing element are higher for individuals in treatment group 2 in comparison.
anova(bps_ref2, bps_ref3)
## Data: bprsl
## Models:
## bps_ref2: bprs ~ weeks + treatment + (weeks | subject)
## bps_ref3: bprs ~ weeks * treatment + (weeks | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## bps_ref2 7 2745.4 2772.6 -1365.7 2731.4
## bps_ref3 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model 3 doesn’t show any superiority, therefore, not better
Fitted <- fitted(bps_ref2)
bprsl <- bprsl %>% mutate(Fitted)
p1 <- ggplot(bprsl, aes(x = weeks, y = bprs, group = subject)) + geom_line() + facet_grid(. ~ treatment, labeller = label_both) + scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) + ylim(0,100) + theme(legend.position = "top") + ggtitle("Observed")
p2 <- ggplot(bprsl, aes(x = weeks, y = Fitted, group = subject)) +
geom_line() + facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
ylim(0, 100) +
theme(legend.position = "top") + ggtitle("Fitted")
p1
p2